# import libraries
library(tidyr)
library(MASS)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(car)
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(knitr)
library(effects)
## Loading required package: carData
##
## Attaching package: 'carData'
## The following objects are masked from 'package:car':
##
## Guyer, UN, Vocab
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
library(HistData)
library(gvlma)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# import dataset
bs <- read.csv("hour.csv")
# verify dataset has been loaded properly
head(bs,3)
## instant dteday season yr mnth hr holiday weekday workingday
## 1 1 2011-01-01 1 0 1 0 0 6 0
## 2 2 2011-01-01 1 0 1 1 0 6 0
## 3 3 2011-01-01 1 0 1 2 0 6 0
## weathersit temp atemp hum windspeed casual registered cnt
## 1 1 0.24 0.2879 0.81 0 3 13 16
## 2 1 0.22 0.2727 0.80 0 8 32 40
## 3 1 0.22 0.2727 0.80 0 5 27 32
Source of dataset: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset Original source: https://www.capitalbikeshare.com/system-data
This dataset contains the hourly and daily record of bike rental counts between year 2011 and 2012 in Washington D.C., provided by Capital Bikeshare, a bike rental company. This dataset also aggregates the weather and the seasonal information particular for that day, including temperature and humidity.
# dataset shape
dim(bs)
## [1] 17379 17
There are 17379 records with 17 columns in this dataset.
# look at structure of dataframe
str(bs)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
The description of the 17 variables are as follows:
Let’s check if there are any missing values in this dataset
# check for missing values
# source: https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame
sapply(bs, function(x) sum(is.na(x)))
## instant dteday season yr mnth hr
## 0 0 0 0 0 0
## holiday weekday workingday weathersit temp atemp
## 0 0 0 0 0 0
## hum windspeed casual registered cnt
## 0 0 0 0 0
There are no missing values for any of the columns in this dataset.
There are several columns that need to be cleaned or dropped:
# drop instant column
bs <- bs %>%
dplyr::select(-instant)
# convert dteday to date time data type
# source: https://www.statmethods.net/input/dates.html
bs$dteday <- as.Date(bs$dteday)
# verify column data type has changed
str(bs)
## 'data.frame': 17379 obs. of 16 variables:
## $ dteday : Date, format: "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
tail(bs, 3)
## dteday season yr mnth hr holiday weekday workingday weathersit
## 17377 2012-12-31 1 1 12 21 0 1 1 1
## 17378 2012-12-31 1 1 12 22 0 1 1 1
## 17379 2012-12-31 1 1 12 23 0 1 1 1
## temp atemp hum windspeed casual registered cnt
## 17377 0.26 0.2576 0.60 0.1642 7 83 90
## 17378 0.26 0.2727 0.56 0.1343 13 48 61
## 17379 0.26 0.2727 0.65 0.1343 12 37 49
# change season back to original string value
bs$season = ifelse(bs$season == 1,"Winter",
ifelse(bs$season == 2, "Spring",
ifelse(bs$season == 3, "Summer", "Fall")))
# verify changes
table(bs$season)
##
## Fall Spring Summer Winter
## 4232 4409 4496 4242
# change weekday values back to original string value
bs$weekday = ifelse(bs$weekday == 1, "Mon",
ifelse(bs$weekday == 2, "Tues",
ifelse(bs$weekday == 3, "Wed",
ifelse(bs$weekday ==4, "Thu",
ifelse(bs$weekday == 5, "Fri",
ifelse(bs$weekday == 6, "Sat", "Sun"))))))
# verify changes
table(bs$weekday)
##
## Fri Mon Sat Sun Thu Tues Wed
## 2487 2479 2512 2502 2471 2453 2475
# change normalized values to original temp values
bs <- bs %>%
mutate(temp_original = (bs$temp * 47) - 8,
atemp_original = (bs$atemp * 66) - 16,
hum_original = hum * 100,
windspeed_original = windspeed * 67)
# verify changes
bs %>%
select(temp_original,atemp_original,
hum_original, windspeed_original) %>%
head(.,3)
## temp_original atemp_original hum_original windspeed_original
## 1 3.28 3.0014 81 0
## 2 2.34 1.9982 80 0
## 3 2.34 1.9982 80 0
# verify cnt = casual + registered
# 0 if correct, 1 if incorrect
cnt_ver <- ifelse(bs$cnt == (bs$registered + bs$casual), 0, 1)
# verify that sum of cnt_ver is 0
sum(cnt_ver)
## [1] 0
Some column names are not intuitive, so it is better that they are changed.
bs <- bs %>%
rename(replace = c('dteday' = 'date',
'weathersit' = 'weather',
'cnt' = 'total_bikes'))
# check dataframe
head(bs, 3)
## date season yr mnth hr holiday weekday workingday weather temp
## 1 2011-01-01 Winter 0 1 0 0 Sat 0 1 0.24
## 2 2011-01-01 Winter 0 1 1 0 Sat 0 1 0.22
## 3 2011-01-01 Winter 0 1 2 0 Sat 0 1 0.22
## atemp hum windspeed casual registered total_bikes temp_original
## 1 0.2879 0.81 0 3 13 16 3.28
## 2 0.2727 0.80 0 8 32 40 2.34
## 3 0.2727 0.80 0 5 27 32 2.34
## atemp_original hum_original windspeed_original
## 1 3.0014 81 0
## 2 1.9982 80 0
## 3 1.9982 80 0
The big question we want to answer using this dataset is how can we predict the number of bikes rented at a certain date and hour, together with other variables such as weather conditions or the type of user.
In order to build the prediction model, we would need to explore and examine not only individual variables, but also the relationship among multiple variables. The result of the analysis will allow us to choose the most appropriate variables to build a model that would help us predict the number of bikes that will be rented.
Thus the rest of this report will follow the following structure:
4: variable analysis 5. statistical tests using variables 6. regression analysis
In this section, we will look at the important variables of this datset, and examine the relationships among multiple variables.
names(bs)
## [1] "date" "season" "yr"
## [4] "mnth" "hr" "holiday"
## [7] "weekday" "workingday" "weather"
## [10] "temp" "atemp" "hum"
## [13] "windspeed" "casual" "registered"
## [16] "total_bikes" "temp_original" "atemp_original"
## [19] "hum_original" "windspeed_original"
There are two type of users of Capital Bikeshare: registered users of the company who have membership, and casual users who borrow bikes for one time purposes.
sum(bs$casual) / sum(bs$total_bikes)
## [1] 0.1883017
sum(bs$registered) / sum(bs$total_bikes)
## [1] 0.8116983
Around 81.2 % of the total bikes were borrowed by registered users, and the rest by casual users. There are a lot more registered users than there are casual users, which is true because not all casual users will be using this company’s bike only (other companies, own bike).
# distribution of total bikes per day
summary(bs$total_bikes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 40.0 142.0 189.5 281.0 977.0
grid.arrange(
ggplot(bs, aes(casual)) +
geom_histogram(color = I('gray')) +
ggtitle("Bikes borrowed per hour by casual users") +
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(registered)) +
geom_histogram(color = I('gray')) +
ggtitle("Bikes borrowed per hour by registered users")+
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(x = total_bikes)) +
geom_histogram(color = I('gray')) +
geom_vline(xintercept = mean(bs$total_bikes), color = I('red'), linetype = 2) +
geom_vline(xintercept = median(bs$total_bikes), color = I('blue'), linetype = 2) +
ggtitle("Total number of bikes borrowed per hour")+
theme(plot.title = element_text(size = 10, face = "bold")),
ncol = 2
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
or both types of users, the number is skewed to the left, which makes sense because it would be really rare to have a lot of users (say 700) using the service at the same time. Because there are a lot more registered users, the distribution of total bikes resembles that of a registered user count more than it does the casual user count distribution. The mean number of rides per day 189.5 (red), but the median is 142 (blue), meaning that the number of ridership is skewed to the left, as seen in the histogram.
Because there are so many registered users compared to the casual users, the usage data related to registered would have a much bigger impact on the total usage data. Therefore, in order to ensure that the casual users’ unique usage is not hidden by the mass registered users’, we will need to look at the data separately between the two types of users for subsequent analysis.
This dataset provides not only date data, but also hourly data, meaning that we can look at the bike usage pattern at different times of the day.
ggplot(bs, aes(x = hr, y = total_bikes)) +
geom_col() +
ggtitle("Total number of bikes rented by hour") +
scale_x_continuous(breaks = seq(0,23,1)) +
scale_y_continuous(breaks = seq(0,350000,50000)) +
theme(plot.title = element_text(size = 15, face = "bold"))
The number peaks at 9am and 5pm~6pm, which is the usual rush hour commute time. Let’s look into this further and see if the usage pattern is same for both the registered and the casual users.
# hourly bike rent count for casual vs registered users
grid.arrange(
ggplot(bs, aes(x = hr, y = casual)) +
geom_col() +
ggtitle("Bikes rented at each hour by casual users") +
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(x = hr, y = registered)) +
geom_col() +
ggtitle("Bikes rented at each hour by registered users") +
theme(plot.title = element_text(size = 10, face = "bold")),
ncol = 2
)
It is clear that there is a difference in the distribution of number of bikes rented per hour between the casual and regisetered users. This may be because the two types of users have different purposes when borrwing the bike. The peaks seen from total number of rentals are not visible from the casual users’ distribution anymore. In fact, there, seems to be a single wide peak for the casual users, which is around the after noon from 12 to 5. This is clearly a working hour during a weekday, which raises another question: do casual and registered users ride primarily on different types of days (i.e., working days vs non working days)?
First, let’s look at how many workingdays and non-working days (holidays and weekends) there are.
ggplot(bs, aes(x = factor(workingday))) +
geom_bar(aes(y = ..count../sum(..count..) * 100)) +
scale_y_continuous(breaks = seq(0,80,10)) +
ggtitle("Number of workingdays and nonworking days") +
theme(plot.title = element_text(size = 15, face = "bold")) +
ylab("percentage (%)")
A little over 30% of the days are either weekends or holidays.
Now, let’s look at how the number of bikes borrowed at each hour by the registered and casual users change during workingdays and non-working days.
grid.arrange(
ggplot(bs, aes(x = hr, y = registered)) +
geom_col() +
scale_x_continuous(breaks = seq(0,23,1)) +
facet_wrap(~factor(workingday)),
ggplot(bs, aes(x = hr, y = casual)) +
geom_col() +
scale_x_continuous(breaks = seq(0,23,1)) +
facet_wrap(~factor(workingday)),
top = "Number of bikes borrowed at each hour \nby user types for working days and non-working day"
)
We can see that while the workingday vs non-working day factor has a huge impact on the hourly number of bikes borrowed, the impact is less evident for the casual users. In fact, judging from the graph, it seems that the number of bikes borrowed are similar for both working and non-working days when it comes to casual users.
grid.arrange(
ggplot(bs, aes(x = factor(workingday), y = casual)) +
geom_col() +
ggtitle("Bikes borrowed by casual users by type of day") +
xlab("Casual users") +
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(x = factor(workingday), y = registered)) +
geom_col() +
ggtitle("Bikes borrowed by registered users by type of day") +
xlab("Registered users") +
theme(plot.title = element_text(size = 10, face = "bold")),
ncol = 2
)
While the number of bikes rented for working days and non-working days indeed is similar for casual users (slightly more during non-working days), it is clear that for the registered users, they predominantly use the bikes on working days. This would explain why the two peaks in the hourly distribution were at the commute time: many registered users are using the bikes as a transporation method for commuting to and from work (or school). Since the usage pattern for the two groups are clearly different, this may mean that we might need a separate model to predict the number of bikes rented for each types of users.
Since a significant number of registered users use the bike as a transportation method, we would expect that the number of rideships will not vary too much by season. On the same note, because half of the casual users ride during non-work days (i.e., for leisure), there should be some difference in rideships depending on the season, as weather might be a more important consideration.
grid.arrange(
ggplot(bs, aes(x = factor(workingday), y = casual)) +
geom_col() +
ggtitle("Bikes rented by workingday \nand holiday for casual users") +
xlab("Casual users") +
facet_wrap(~season) +
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(x = factor(workingday), y = registered)) +
geom_col() +
ggtitle("Bikes rented by workingday \nand holiday for registered users") +
xlab("Registered users") +
facet_wrap(~season) +
theme(plot.title = element_text(size = 10, face = "bold")),
ncol = 2
)
While the expectation seems to hold true for the registered group (i.e., rideships don’t vary too much by season) except in the winter when the number of rideships decrease in general, there seems to be no big differene in rideships for working days and non-working days for the casual groups too for each season. A further hypothesis test should be conducted to see if the working day and season categories are independent from one another for the casual users.
Riding a bike is different from many other modes of transporation, as the rider is usually fully exposed to the environment during the ride. As such is the case, weather conditions, including the actual weather situation, temperature, humidity, and wind, are all important factors that may influence the number of bikes used (or borrwed in this case).
What weather was the most common in the dataset?
ggplot(bs, aes(x = weather))+
geom_bar() +
ggtitle("Occurances of each weather type") +
theme(plot.title = element_text(size = 15, face = "bold"))
While milder weathers are the most common, the harshest weathers including a snow storm or thundertorm are far less common in the dataset. Are these weather patterns evenly seen in all seasons?
ggplot(bs, aes(x = season,
y = ..prop.., group = 1)) +
stat_count(show.legend = F) +
facet_wrap(~factor(weather)) +
geom_text(stat = 'count',
aes(label = sprintf("%0.2f",
round(..prop.., digits = 2))),
vjust = 0,
size = 2) +
ggtitle("Proportion of each season by weather type") +
theme(plot.title = element_text(size = 15, face = "bold"))
We can see that the harshest weather only occur in Winter, and though the number of harsh weather (4) is small, this may have an impact on the average number of daily ridership for Winter.
Would each of these weather have an impact on the number of bikes borrowed by each user?
grid.arrange(
ggplot(bs, aes(y = casual, x = factor(weather))) +
geom_boxplot(),
ggplot(bs, aes(y = registered, x = factor(weather))) +
geom_boxplot(),
ncol = 2,
top = "Distribution of bikes borrowed by each weather type and user"
)
Number of total bike rented decreases as the weather gets harsher, which is predictable. It is however interesting to see that there are outliers in all weather conditions (except the harshest 4), meaning that there are significant number of people who use bikes regardless of some weather changes. Would we see similar patterns for each season?
grid.arrange(
ggplot(bs, aes(y = casual, x = season)) +
geom_boxplot(),
ggplot(bs, aes(y = registered, x = season)) +
geom_boxplot(),
ncol = 2,
top = "Distribution of bikes borrowed by season and user"
)
Unsurprisingly, average daily rideship decreases during winter, possibly due to factor such as weather condition or temperature. It is interesting to see that there are outliers in the top in all seasons, meaning that there are some people who ride bikes regardless of the season. These people may be riding bikes not for leisure, but for transportation means. Furthermore, this observation further confirms that registered users are less impacted by weather factors such as season and weather situation, as they use bikes for transporation means.
While seasons and weather situations are important to look at, those are aggregated data of different days with different temperatures, humidity, and windspeed. Let’s look at the specific weather conditions.
There are two types of temperature given in this datset - the normal air temperature, and the apparent temperature, which is the temperature actually perceived by humans, accounting for other weather conditions such as humidity and windspeed. A natural question therefore would be whether or not air temperature and apparent temperature similar.
grid.arrange(
ggplot(bs, aes(temp_original)) +
geom_histogram(color = I('gray')) +
ggtitle("Air temperature distribution") +
scale_y_continuous(breaks = seq(0,2100,300), limits = c(0,2100)) +
theme(plot.title = element_text(size = 10, face = "bold")),
ggplot(bs, aes(atemp_original)) +
geom_histogram(color = I('gray')) +
ggtitle("Apparent temperature dyistribution") +
scale_y_continuous(breaks = seq(0,2100,300), limits = c(0,2100)) +
theme(plot.title = element_text(size = 10, face = "bold")),
ncol = 2
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The overall shape seems to be similar except for the peark around 27 degrees in atemp_original. Since apparent temperature is affected by not only temperature but also other weather conditions such as humidity and wind speed, this may explain why the distribution is not exactly equal to one another.
summary(bs$atemp_original)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.000 5.998 15.997 15.401 24.999 50.000
summary(bs$temp_original)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.06 7.98 15.50 15.36 23.02 39.00
ggplot(bs) +
geom_boxplot(aes(x = factor('temperature'),y = temp_original)) +
geom_boxplot(aes(x = factor('app_temperature'),y = atemp_original)) +
scale_y_continuous(breaks = seq(-20,50,10)) +
ggtitle("Distribution of temperature and apparent temperature") +
theme(plot.title = element_text(size = 15, face = "bold"))
It also seems that the apparent temperature is more sparsely distributed than temperature, as shown by the larger IQR box height and the min and max whisker. The median, however, is more or less smiliar around 15 ~ 16 degrees. Let’s look at this further by each season, where temperature should differ significantly.
# data needs to be gathered to achieve this
bs_melt <- bs %>%
select(temp_original, atemp_original,season) %>%
gather(key = temp_type, value = temp_value, -season)
head(bs_melt)
## season temp_type temp_value
## 1 Winter temp_original 3.28
## 2 Winter temp_original 2.34
## 3 Winter temp_original 2.34
## 4 Winter temp_original 3.28
## 5 Winter temp_original 3.28
## 6 Winter temp_original 3.28
# boxplot comparing temperature for each season
ggplot(bs_melt, aes(x = season, y = temp_value, fill = temp_type)) +
geom_boxplot() +
ggtitle("Distribution of temperature and apparente temperature by season") +
theme(plot.title = element_text(size = 13, face = "bold"))
It is interesting to see that the apparent temperature is often more extreme than the actual air temperature. For example, the apparent temperature is higher than the actual air temperature during summer, while it is on average lower than the actual temperature during winter.
The important question is then, does the ridership differ by each temperature for each users?
# does the ridership differ for each type of user by temperature?
grid.arrange(
ggplot(bs, aes(x = temp_original)) +
geom_point(aes(y = casual),
color = I('red'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = temp_original)) +
geom_point(aes(y = registered),
color = I('blue'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = atemp_original)) +
geom_point(aes(y = casual),
color = I('red'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = atemp_original)) +
geom_point(aes(y = registered),
color = I('blue'),
alpha = 0.3, position = 'jitter'),
ncol = 2,
top = "Bikes borrowed by each type of user and temperature vs apparent temperature"
)
Interestingly, there doesn’t seem to be a big pattern, other than the fact that there seems to be some divison in different points of apparent temperature for both the casual (around 20 degrees) and the registered (around 10 and 20 degrees) users.
Let’s look at humidity and windspeed next. How are humidity and windspeed distributed in the dataset?
summary(bs$hum_original)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 48.00 63.00 62.72 78.00 100.00
summary(bs$windspeed_original)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.002 12.998 12.737 16.998 56.997
grid.arrange(
ggplot(bs, aes(hum_original)) +
geom_histogram(color = I('gray')) +
ggtitle("Humidity distribution") +
theme(plot.title = element_text(size = 12, face = "bold")),
ggplot(bs, aes(windspeed_original)) +
geom_histogram(color = I('gray')) +
ggtitle("Windspeed distribution") +
theme(plot.title = element_text(size = 12, face = "bold")),
ncol = 2
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both the humidity and windspeed seemed to be somewhat skewed, especially the windspeed. According to the Beaufort wind force scale classication (https://en.wikipedia.org/wiki/Beaufort_scale), wind speed between 20 to 28 is considered a moderate breeze, between 29 to 38 fresh breeze, between 39 to 49 strong breeze, and between 50 to 61 high wind (moderate gale). This means that most of the days, the windspeed was lower than a moderate breeze, and that there were few days with very strong winds that may affect bike ridership.
Does the windspeed and humidity differ by each season as do the temperature?
# windspeed for each season
ggplot(bs, aes(x = season, y = windspeed_original)) +
geom_boxplot() +
ggtitle("Windspeed by season") +
theme(plot.title = element_text(size = 15, face = "bold"))
# humidity for each season
ggplot(bs, aes(x = season, y = hum_original)) +
geom_boxplot() +
ggtitle("Humidity by season") +
theme(plot.title = element_text(size = 15, face = "bold"))
# humidity and windspeed?
grid.arrange(
ggplot(bs, aes(x = hum_original)) +
geom_point(aes(y = casual),
color = I('red'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = hum_original)) +
geom_point(aes(y = registered),
color = I('blue'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = windspeed_original)) +
geom_point(aes(y = casual),
color = I('red'),
alpha = 0.3, position = 'jitter'),
ggplot(bs, aes(x = windspeed_original)) +
geom_point(aes(y = registered),
color = I('blue'),
alpha = 0.3, position = 'jitter'),
ncol = 2
)
While weather conditions all have impact on ridership in general, it is also true that the level of impact differs for each type of user. In the statistical test, it would be interesting to further this observation and see if the differences are significant.
# two-tailed, 2 independent variables t-test, 95% confidence level
t.test(bs$temp_original, bs$atemp_original, alternative = 'two.sided', paired = T, mu = 0)
##
## Paired t-test
##
## data: bs$temp_original and bs$atemp_original
## t = -2.0204, df = 17378, p-value = 0.04335
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.084242583 -0.001277067
## sample estimates:
## mean of the differences
## -0.04275983
The p-value is 0.043, which means that there is enough evidence to prove that the air temperature and the temperature perceived by humans differ significantly.
summary(aov(data = bs, casual ~ season + factor(workingday) + season:factor(workingday)))
## Df Sum Sq Mean Sq F value Pr(>F)
## season 3 3490647 1163549 594.2 <2e-16 ***
## factor(workingday) 1 4086243 4086243 2086.9 <2e-16 ***
## season:factor(workingday) 3 655924 218641 111.7 <2e-16 ***
## Residuals 17371 34012861 1958
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(data = bs, registered ~ season + factor(workingday) + season:factor(workingday)))
## Df Sum Sq Mean Sq F value Pr(>F)
## season 3 19542008 6514003 304.194 <2e-16 ***
## factor(workingday) 1 6492104 6492104 303.171 <2e-16 ***
## season:factor(workingday) 3 96220 32073 1.498 0.213
## Residuals 17371 371982758 21414
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This confirms the hypothesis formed during the multivariate analysis. While season and workingday both have an impact on the number of bikes rented for both types of users, the interaction effect between the two variables don’t exist for registered users, while it does for the causal users. Again, this is because registered users, who mostly use the bikes for commute purposes, are less impacted by seasonal factors than are casual users, half of whom ride bikes for leisure. We can also confirm the interaction effect visually:
interaction.plot(bs$season, bs$workingday, bs$casual)
interaction.plot(bs$season, bs$workingday, bs$registered)
From the interaction plot, it is clear that season has an impact on ridership for casual users. More specifically, the number of bikes rented for non-working days drop more drastically in Winter than it does for working days. On the contrast, the interaction plot for the registered users show that the patterns of total number of bikes are similar (if not the same), regardless of the season.
H0: No difference in ridership with time of the day H1: There is some difference For this purpose we create a categorical variable from hr - hr_cat: 1. Late Night 2. Early Morning 3. Afternoon 4. Evening/Night
bs= bs %>%
mutate(hr_cat=ifelse(hr>=0 & hr<=5,"1",
ifelse(hr>=6 & hr <=11,2,
ifelse(hr>=12 & hr<=17,3,4))))
a1=aov(data = bs, total_bikes ~ hr_cat)
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## hr_cat 3 172231550 57410517 2497 <2e-16 ***
## Residuals 17375 399530041 22995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the above result (p-value<0.05) with 95% confidence, we reject H0 and conclude that there is some difference.
Now lets find out when the ridership is the highest.
a2=TukeyHSD(a1)
print(a2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total_bikes ~ hr_cat, data = bs)
##
## $hr_cat
## diff lwr upr p adj
## 2-1 183.19213 174.806839 191.57742 0e+00
## 3-1 270.57533 262.197157 278.95350 0e+00
## 4-1 200.84900 192.467509 209.23048 0e+00
## 3-2 87.38320 79.045944 95.72045 0e+00
## 4-2 17.65687 9.316279 25.99745 4e-07
## 4-3 -69.72633 -78.059760 -61.39290 0e+00
plot(a2)
Looking at the above chart we can rank the demand. $ Ridership in Afternoon> Evening/ Night > Early Morning> Late Night$ 1. Ridership is the highest in Afternoon 2. Ridership is the lowest in Late Night
Lets define three types of days: 1. Holiday 2. Working day 3. Weekend
H0: No difference in ridership with type of day H1: There is some difference
bs= bs %>%
mutate(typeofday = ifelse(holiday==0 & workingday==0,"Weekend",
ifelse(holiday ==1, "Holiday", "Working Day" )))
b1 = aov(data=bs, total_bikes ~ typeofday )
summary(b1)
## Df Sum Sq Mean Sq F value Pr(>F)
## typeofday 2 855393 427697 13.02 2.24e-06 ***
## Residuals 17376 570906198 32856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is less than .05 we can reject H0 and conclude that the rideship is different during different types of days.
Now again lets find out on which type of days the ridership is higher.
b2= TukeyHSD(b1)
print(b2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total_bikes ~ typeofday, data = bs)
##
## $typeofday
## diff lwr upr p adj
## Weekend-Holiday 26.98201 7.056793 46.90724 0.0042972
## Working Day-Holiday 36.33775 16.941175 55.73433 0.0000338
## Working Day-Weekend 9.35574 2.199345 16.51213 0.0061887
plot(b2)
Conclusion: Taking a look at the above table and the plot we can conclude that - \(Ridership on Working Day > Ridership on Weekend > Ridership on Holiday\)
Lets see if the type f weather has an impact on ridership.
H0: Type of weather has no impact on riderwhip H1: There is some impact
c1 = aov(data=bs, total_bikes ~ factor(weather, levels = c(1,2,3)))
summary (c1)
## Df Sum Sq Mean Sq F value
## factor(weather, levels = c(1, 2, 3)) 2 12245259 6122629 190.1
## Residuals 17373 559464416 32203
## Pr(>F)
## factor(weather, levels = c(1, 2, 3)) <2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 3 observations deleted due to missingness
Since the p-value is <0.05 we cab conclude that the ridership depends on the weather. Now lets find out how the weather affects the ridership.
c2 = TukeyHSD(c1)
print(c2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total_bikes ~ factor(weather, levels = c(1, 2, 3)), data = bs)
##
## $`factor(weather, levels = c(1, 2, 3))`
## diff lwr upr p adj
## 2-1 -29.70378 -37.08188 -22.32567 0
## 3-1 -93.28999 -105.12979 -81.45019 0
## 3-2 -63.58621 -76.37738 -50.79504 0
plot(c2)
Looking at the above data we can conclude that: $Ridership in Clear weather > Mist > Light Snow/rain $
We can use the variables we have explored to predict how many bikes will be borrowed in a given hour and day?
Now let’s validate our 2 user models by using k-folds cross validation.
trainingFold1 = createDataPartition(bs$casual, p = 0.8)
training1 = bs[trainingFold1$Resample1, ]
testing1 = bs[-trainingFold1$Resample1, ]
trainMethod = trainControl(method="cv", number=5, returnData =TRUE, returnResamp = "all")
model_casual = train(data=training1, casual ~ factor(hr) + factor(weather) + season + windspeed_original + factor(workingday), method = 'lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
summary(model_casual)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.469 -20.811 -3.579 13.816 286.757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.81512 1.63537 20.066 < 2e-16 ***
## `factor(hr)1` -4.33155 2.01378 -2.151 0.031497 *
## `factor(hr)2` -5.83874 2.03397 -2.871 0.004103 **
## `factor(hr)3` -8.87002 2.03148 -4.366 1.27e-05 ***
## `factor(hr)4` -10.03819 2.03798 -4.926 8.51e-07 ***
## `factor(hr)5` -9.28077 2.02497 -4.583 4.62e-06 ***
## `factor(hr)6` -5.84334 2.03876 -2.866 0.004161 **
## `factor(hr)7` 0.57100 2.00137 0.285 0.775416
## `factor(hr)8` 12.17885 2.01251 6.052 1.47e-09 ***
## `factor(hr)9` 20.56329 2.01551 10.203 < 2e-16 ***
## `factor(hr)10` 36.54202 2.02824 18.017 < 2e-16 ***
## `factor(hr)11` 51.09415 2.02132 25.278 < 2e-16 ***
## `factor(hr)12` 57.82189 2.03322 28.439 < 2e-16 ***
## `factor(hr)13` 63.50282 2.02947 31.290 < 2e-16 ***
## `factor(hr)14` 66.50125 2.01560 32.993 < 2e-16 ***
## `factor(hr)15` 65.60599 2.01089 32.625 < 2e-16 ***
## `factor(hr)16` 64.58776 2.03656 31.714 < 2e-16 ***
## `factor(hr)17` 64.23472 2.03286 31.598 < 2e-16 ***
## `factor(hr)18` 51.01490 2.01311 25.341 < 2e-16 ***
## `factor(hr)19` 37.41676 2.02845 18.446 < 2e-16 ***
## `factor(hr)20` 26.26593 2.01315 13.047 < 2e-16 ***
## `factor(hr)21` 18.04512 2.01034 8.976 < 2e-16 ***
## `factor(hr)22` 11.95446 2.01748 5.925 3.19e-09 ***
## `factor(hr)23` 5.50100 2.00116 2.749 0.005987 **
## `factor(weather)2` -6.88966 0.68127 -10.113 < 2e-16 ***
## `factor(weather)3` -21.34174 1.07918 -19.776 < 2e-16 ***
## `factor(weather)4` -26.58351 19.88001 -1.337 0.181180
## seasonSpring 16.08865 0.83336 19.306 < 2e-16 ***
## seasonSummer 18.66586 0.82543 22.613 < 2e-16 ***
## seasonWinter -17.16027 0.84456 -20.318 < 2e-16 ***
## windspeed_original -0.14224 0.03756 -3.787 0.000153 ***
## `factor(workingday)1` -32.32637 0.62885 -51.406 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.39 on 13873 degrees of freedom
## Multiple R-squared: 0.5117, Adjusted R-squared: 0.5106
## F-statistic: 468.9 on 31 and 13873 DF, p-value: < 2.2e-16
trainingFold2 = createDataPartition(bs$registered, p = 0.8)
training2 = bs[trainingFold2$Resample1, ]
testing2 = bs[-trainingFold2$Resample1, ]
trainMethod = trainControl(method="cv", number=5, returnData =TRUE, returnResamp = "all")
model_registered = train(data=training2, registered ~ season + factor(hr) + factor(workingday) + factor(weather), method = 'lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
summary(model_registered)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -341.80 -52.89 -9.46 43.74 474.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.133 4.390 8.687 < 2e-16 ***
## seasonSpring -6.570 2.304 -2.851 0.00437 **
## seasonSummer 14.520 2.307 6.295 3.17e-10 ***
## seasonWinter -73.037 2.329 -31.364 < 2e-16 ***
## `factor(hr)1` -16.913 5.615 -3.012 0.00260 **
## `factor(hr)2` -27.490 5.612 -4.899 9.76e-07 ***
## `factor(hr)3` -36.198 5.647 -6.410 1.50e-10 ***
## `factor(hr)4` -39.322 5.673 -6.932 4.34e-12 ***
## `factor(hr)5` -26.072 5.619 -4.640 3.52e-06 ***
## `factor(hr)6` 30.236 5.607 5.392 7.06e-08 ***
## `factor(hr)7` 160.656 5.600 28.686 < 2e-16 ***
## `factor(hr)8` 290.191 5.564 52.151 < 2e-16 ***
## `factor(hr)9` 146.081 5.608 26.048 < 2e-16 ***
## `factor(hr)10` 85.672 5.630 15.217 < 2e-16 ***
## `factor(hr)11` 105.193 5.598 18.791 < 2e-16 ***
## `factor(hr)12` 143.746 5.622 25.568 < 2e-16 ***
## `factor(hr)13` 139.522 5.599 24.918 < 2e-16 ***
## `factor(hr)14` 124.242 5.606 22.160 < 2e-16 ***
## `factor(hr)15` 135.117 5.602 24.119 < 2e-16 ***
## `factor(hr)16` 195.203 5.601 34.850 < 2e-16 ***
## `factor(hr)17` 343.669 5.597 61.406 < 2e-16 ***
## `factor(hr)18` 315.973 5.603 56.390 < 2e-16 ***
## `factor(hr)19` 215.494 5.550 38.827 < 2e-16 ***
## `factor(hr)20` 149.539 5.559 26.902 < 2e-16 ***
## `factor(hr)21` 101.574 5.581 18.201 < 2e-16 ***
## `factor(hr)22` 65.365 5.612 11.648 < 2e-16 ***
## `factor(hr)23` 30.888 5.630 5.487 4.17e-08 ***
## `factor(workingday)1` 43.248 1.750 24.719 < 2e-16 ***
## `factor(weather)2` -10.890 1.903 -5.722 1.08e-08 ***
## `factor(weather)3` -75.927 2.999 -25.316 < 2e-16 ***
## `factor(weather)4` -87.015 55.446 -1.569 0.11658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95.91 on 13874 degrees of freedom
## Multiple R-squared: 0.5947, Adjusted R-squared: 0.5938
## F-statistic: 678.5 on 30 and 13874 DF, p-value: < 2.2e-16
Again, the r2 for both models are still at 51~ 59%, meaning that we may need more variables that may explain the total number of bikes better.
While we have had some interesting insights for our dataset, we weren’t able to build a sastisfying model despite our efforts:
We believe that there are various ways in which this prediction model can be improved: * Try other predictive algorithms: this dataset may not be appropriate for linear regression, and thus that may be the reason why the predictive results were not satisfying. * More data on customers: each of the row in this dataset is actually an aggregate of customers by each hour. This means that there are lack of customer related data, especially on an individual level. It would be nice to have more information on individual customers that may help improve the prediction power (e.g. age, gender, nationality etc.), as these personal variables may also have an impact on the usage pattern. Currently, the only thing we know about the customers themselves are whether they are registered or casual customers, and to assume all of them would borrow bieks for similar reasons and patterns would be a big mistake.